-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spreadmissings
, the backend for column-wise @passmissing
#276
base: master
Are you sure you want to change the base?
Conversation
Okay I've thought about this more. I think we should still put this in 1.0. We should call it It's important that people be able to do A more coherent story about automatically creating |
nonmissingmask .&= .!ismissing.(v) | ||
end | ||
|
||
nonmissinginds = findall(nonmissingmask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrames._findall
will be more efficient
firstindices = eachindex(first(vecs)) | ||
if !all(x -> eachindex(x) == firstindices, vecs) | ||
err_str = "Indices of arguments do not match. Indices are " * | ||
join(string.eachindex.(vecs), ", ", " and ") * "." | ||
|
||
throw(ArgumentError(err_str)) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming this is an internal function I think we do not need it. DataFrames.jl enforces 1-based indexing.
|
||
nonmissingmask = fill(true, length(vecs[1])) | ||
for v in vecs | ||
nonmissingmask .&= .!ismissing.(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is inefficient. See https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/abstractdataframe.jl#L758 for an efficient implementation.
|
||
Given a function `f`, wraps `f` so that a `view` selecting non-missing | ||
values of `AbstractVector` argument is applied before calling `f`. After `f` | ||
is called, if the result is a `AbstractVector` of the appropriate length, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define "appropriate length"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also comment exactly what happens otherwise.
|
||
return out | ||
else | ||
return res |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about this design. The "appropriate length" part seems fragile. I understand the intention, but it seems incorrect somehow.
I feel that the best design would be to have two separate functions:
- one leaving the result "as is"
- the other explicitly requiring the vector to be of matching length
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, another complication is how the spreading works. DataFrames.jl has some fairly complicated rules regarding spreading scalar values across columns.
Let's say you do
@rtransform @passmissing z = mean(:y)
this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intention is to resolve such things with JuliaData/DataFrames.jl#2794.
The challenge here, though, is that you most likely want to combine operations having different rows that are skipped in a single call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed if the result is an AbstractVector
it should be required to have a length equal to the number of non-missing entries in the input. Relying on the exact size to choose the behavior is brittle (you could end up having the expected number of elements just by chance, and then your code would break). But relying on the type seems OK, it's very similar to broadcasting.
Let's say you do
@rtransform @passmissing z = mean(:y)
this would currently spread the mean across all rows. But maybe it's more logical to spread it across only non-missing rows. Unfortunately this would require re-implementing the DataFrames.jl spreading logic.
Well it would also be quite common for users to wish to fill all rows with the mean, including those with missing values. That happens for example if you want to create a group-level variable in a multi-level model.
@@ -283,6 +283,101 @@ macro byrow(args...) | |||
end | |||
|
|||
|
|||
struct SpreadMissings{F} <: Function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the rationale behind SpreadMissings
name? It would be more natural for me to call it something like "complete cases" (or just SkipMissings
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally named for the "spreading" that happens when the shorter vector with the non-missing indices is spread out to match the length of the inputs. But I agree that "complete cases" is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with the name SkipMissings
is that this type is quite different from SkipMissing
due to the "spreading" behavior. Actually, this function can be defined as a vectorized passmissing
when the wrapped function returns a vector; but when the function returns a scalar it leaves it as a scalar. This behavior is what seems to make the most sense for users: an alternative would be to "spread" scalars so that missing
is used in positions which are missing
in the input, but that doesn't sound super useful.
So it's hard to find a good name due to this hybrid behavior. I'd almost call it ArrayPassMissing
. Anyway for now it's internal.
But what matters is the API that DataFramesMeta (and probably DataFrames at some point) exposes to users. The approach adopted here is to make @passmissing
the go-to solution to handle missing values, which would work for mean(:x)
, cor(:x, :y)
, scale(:x)
, etc. It's appealing as most users wouldn't have to care about the subtle differences between passmissing
and skipmissing
and their potential variants (mean(skipmissing(:x))
would be an alternative to @passmissing mean(:x)
which makes a single pass over the data, but most users probably don't care a lot).
Another approach would be to require users to choose what they want to do, but it would be much more annoying to use:
@skipmissing
would useskipmissing∘f
for single arguments, or a new variant ofskipmissing
for multiple arguments which passes a view of their non-missing entries; both would leave the returned value as-is@passmissing
would usepassmissing(f)
forAbstractArray
inputs, orspreadmissing(f)
forAbstractArray
inputs, and would require the output to be anAbstractArray
with its length equal to the number of non-missing entries in the input (or possibly broadcast a scalar result to non-missing positions)
In summary: I still think it is a hard design choice what do to here as the choice is difficult. Why do you think we must have it for 1.0 release. I would feel that this is something we could add later (adding it in 1.1 release will be non breaking). |
Maybe this does not need to be in the 1.0 release. The main motivation at the moment is Side note, this infrastructure might also be useful for something like the following:
which has also been requested. But these are APIs that we haven't worked out 100% yet. And we should maybe push for 1.0 before deciding them. |
I agree we should solve it, but I assume that whatever we do will not break the other parts of DataFramesMeta.jl. Right? |
No, it does not require deprecating anything. |
I was thinking about it. I think Obervations:
The case |
You might be right, but I'm not 100% sure. Maybe a good compromise would be to require some explicit flags for spreading vector output. It's definitely more important to spread the result in some sort of futre But it seems like this doesn't need to be in 1.0 so we can table this discussion for now. |
I just ran into this issue today. We should definitely fix this, and should be easier now that you can do transformations with subsets. |
Given the JuliaData/DataFrames.jl#2794 PR we now have if
and if I understand the design of
(I use A limitation of
But I think it is OK as this is the most common use cases and I think we should optimize this case. For the case of
which:
If we agreed to this then maybe even such |
What do you mean by this?
I would find it annoying to have a function specific to DataFrames for this, as it would mean that either we still wouldn't provide any solution outside of DataFrames, or that users would have to know two different functions. Maybe we could have a single function which is defined outside of DataFrames, but that can also handle Also while I agree that |
I think
I definitely think this is the right direction, and might be the right implementation. However I think there are two issues to work through
should return
Implementing this feature via transformations on sub-dataframes means we don't have to worry about any of the Tables.jl-related stuff, which is nice. Or at least less of it. That is, we overwrite with |
@pdeffebach - where is the definition of
If the answer to both questions is yes then I think the proposed design is OK. (all comments by @pdeffebach to the @nalimilan - if @pdeffebach wants to define |
currently this does not work for column-wise transformation (
It should be able to be called everywhere, I think, including
calling
should enable users to get around this problem.
No, I think we should also support |
Yes in this PR we're talking about a DataFrames-specific API, but I think it would be good to keep in mind the broader picture to avoid inconsistencies with the rest of the ecosystem once we add a function to Missings.jl -- just like currently
@bkamins Right. But as I proposed above we could override the behavior of a more general function (say, Regarding |
Regarding
will try to expand the 11-element vector returned by |
But this is an issue with the current implementation, right? I think we can get around this with some combination of the following
|
I am talking about the design not about the implementation (which I think is a secondary issue and for sure can be resolved). What I ask for is for the clear rules how users can the operation be performed when called in |
Yes for Anyway maybe we could disallow using |
I think it is a good approach. Given this is an experimental feature I would limit the behavior to the cases we are sure are common and we are clear how they should work. |
@bkamins
This is the initial implementation of
spreadmissings
, which began as a PR in Missings.jl. It's fortunately much simpler than the original implementation since we don't have to work with non-vector inputs.I'm actually kind of worried about the utility of this function. It works very well with
cor
and other functions where we need pairs of functions. But it creates dangerous behavior for things likemean
.A user might write
f(x, y) = x .- mean(y)
. If they dospreadmissings(f)(x, y)
themean(y)
part of the function will actually be biased, examining only subsets ofy
wherex
is also non-missing.We need a much more heavy-handed name, like
nonmissingpairs
or something.Maybe this doesn't need to get merged before 1.0 considering that this API won't have the same name as
@passmissing
.See discourse question here also.